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Abstract. In multi-label learning, each sample is associated with several labels. 
Existing works indicate that exploring correlations between labels improve the 
prediction performance. However, embedding the label correlations into the train- 
ing process significantly increases the problem size. Moreover, the mapping of 
the label structure in the feature space is not clear. In this paper, we propose a 
novel multi-label learning method "Structured Decomposition + Group Sparsity 
(SDGS)". In SDGS, we learn a feature subspace for each label from the structured 
decomposition of the training data, and predict the labels of a new sample from 
its group sparse representation on the multi-subspace obtained from the struc- 
tured decomposition. In particular, in the training stage, we decompose the data 
matrix X G R" xp as X — J2i=i L* + S, wherein the rows of V associated with 
samples that belong to label i are nonzero and consist a low-rank matrix, while 
the other rows are all-zeros, the residual S is a sparse matrix. The row space of 
Li is the feature subspace corresponding to label i. This decomposition can be 
efficiently obtained via randomized optimization. In the prediction stage, we es- 
timate the group sparse representation of a new sample on the multi-subspace via 
group lasso. The nonzero representation coefficients tend to concentrate on the 
subspaces of labels that the sample belongs to, and thus an effective prediction 
can be obtained. We evaluate SDGS on several real datasets and compare it with 
popular methods. Results verify the effectiveness and efficiency of SDGS. 



1 Introduction 

Multi-label learning [1] aims to find a mapping from the feature space X C W to the 
label vector space y C {0, l} fc , wherein k is the number of labels and yi = 1 denotes 
the sample belongs to label i. Binary relevance (BR) [2] and label powerset (LP) [2] are 
two early and natural solutions. BR and LP transform a multi-label learning problem 
to several binary classification tasks and single-label classification task, respectively. 
Specifically, BR associates each label with an individual class, i.e., assigns samples 
with the same label to the same class. LP treats each unique set of labels as a class, in 
which samples share the same label vector. 

Although BP/LP and their variants can directly transform a multi-label learning 
problem into multiple binary classification tasks or single-label classification task, multi- 
label learning brings new problems. First, the labels are not mutually exclusive in multi- 
label learning, and thus it is necessary to consider not only the discriminative informa- 
tion between different labels but also their correlations. Second, the large number of 
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labels always leads to the imbalance between positive samples and negative ones in 
each class, and this limits the performance of binary classification algorithms. Third, 
the problem size of multi-label learning will be significantly increased when it is de- 
composed into many binary classification problems. 

Recent multi-label learning methods more or less tackle some of the above prob- 
lems and demonstrate that the prediction performance can be improved by exploiting 
specific properties of the multi-label data, e.g., label dependence, label structure, and 
the dependence between samples and the corresponding labels. We categorize popular 
methods into two groups. 

1 . The first group of methods transform multi-label prediction into a sequence of bi- 
nary classification methods with special structures implied by label correlations. 
For example, the random k-labelsets (RAkEL) method [3] randomly selects an en- 
semble of subset from the original labelsets, and then LP is applied to each subset. 
The final prediction is obtained by ranking and thresholding of the results on the 
subsets. Hierarchical binary relevance (HBR) [4] builds a general-to-specific tree 
structure of labels, where a sample with a label must be associated with its parent 
labels. A binary classifier is trained on each non-root label. Hierarchy of multi-label 
classifiers (HOMER) [5] recursively partitions the labels into several subsets and 
build a tree-shaped hierarchy. A binary classifier is trained on each non-root label 
subset. The classifier chain (CC) [6] adopts a greedy way to predict unknown label 
from feature and predicted labels via binary classifier. 

2. The second group of methods formulate the multi-label prediction to other kinds 
of problems rather than binary classification. For example, the C&W procedure [7] 
separates the problem into two stages, i.e., BR and correction of the BR results by 
using label dependence. Regularized multi-task learning [8] and shared-subspace 
learning [9] formulate the problem as regularized regression or classification prob- 
lem. Multi-label k-nearest neighbor (ML-kNN) [10] is an extension of kNN. Multi- 
label dimensionality reduction via dependence maximization (MDDM) [11] max- 
imizes the dependence between feature space and label space, and provides a data 
preprocessing for other multi-label learning method. A linear dimensionality reduc- 
tion method for multi-label data is proposed in [12]. In [13], multi-label prediction 
is formulated as a sparse signal recovery problem. 

However, the problem size always significantly increases when multi-label learning 
are decomposed into a set of binary classification problems or formulated as another 
existing problem, because the label correlations need to be additionally considered. 
Furthermore, the mapping of label structure in feature space has not been studied. In 
this paper, we propose a novel multi-label learning method "Structured Decomposition 
+ Group Sparsity (SDGS)", which assigns each label a corresponding feature subspace 
via randomized decomposition of the training data, and predicts the labels of a new 
sample by estimating its group sparse representation in the obtained multi-subspace. 

In the training stage, SDGS approximately decomposes the data matrix X G K nxp 
(each row is a training sample) as X — 5Z i=1 L % + S. In the matrix IS, only the 
rows corresponding to samples with label i (i.e., yi = 1) are nonzero. These rows 
represent the components determined by label i in the samples and compose a low-rank 



3 



matrix, which row space is the feature subspace characterized by label i. The matrix S 
represents the residual components that cannot be explained by the given labels and is 
constrained to be sparse. The decomposition is obtained via a randomized optimization 
with low time complexity. 

In the prediction stage, SDGS estimates the group sparse representation of a new 
sample in the obtained multi-subspace via group lasso [14]. The representation coef- 
ficients associated with basis in the same subspace are in the same group. Since the 
components caused by a specific label can be linearly represented by the correspond- 
ing subspace obtained in the training stage, the nonzero representation coefficients will 
concentrate on the groups corresponding to the labels that the sample belongs to. This 
gives the rational of the proposed SDGS for multi-label learning. Group lasso is able to 
select these nonzero coefficients group-wisely and thus the labels can be identified. 

SDGS provides a novel and natural multi-label learning method by building a map- 
ping of label structure in decomposed feature subspaces. Group sparse representation in 
the multi-subspace is applied to recover the unknown labels. It embeds the label corre- 
lations without increasing the problem size and is robust to the imbalance problem. By 
comparing SDGS with different multi-label learning methods, we show its effectiveness 
and efficiency on several datasets. 



2 Assumption and Motivation 

Given a sample x £ M. p and its label vector y G {0, l} fc , we assume that x can be 
decomposed as the sum of several components l % and a sparse residual s: 

x= J2 r + s- (1) 

i:j/i=l 

The component l % is caused by the label i that x belongs to. Thus l l can be explained 
as the mapping of label i in x. The residual s is the component that all the labels in y 
cannot explain. The model in (1) reveals the general relationship between feature space 
and labels. 

For all the samples with label i, we assume their components corresponding to 
label i lies in a linear subspace C % <E R r xp , i.e., I* = ficfi 1 , wherein /3a i is the 
representation coefficients corresponding to C . Thus the model (1) can be equivalently 
written as: 

k 

x = ^f3 Gi c i + s, (2) 

Vie 0:^=0},^ =0. 

If we build a dictionary C = [C 1 ; . . . ; C k ] as the multi-subspace characterized by the k 
labels, the corresponding representation coefficient vector for x is f3 = [f3c 1 , • • • , /?G J- 
The coefficients (3a i corresponding to the labels x does not belong to are zeros, so (3 is 
group sparse, wherein the groups are Gi : i — 1, . . . , h. 

In training stage of SDGS, we learn the multi-subspace C l ,i = 1, . . . , k from the 
training data via a structured decomposition, in which the components corresponding to 
label i from all the samples consists a low-rank matrix L} 2 ., wherein Qi is the index set 
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of samples with label i. Thus the row space of L} 2 . is the subspace C\ In the prediction 
stage of SDGS, given a new sample x, we apply group lasso to find the group sparse 
representation j3 on the multi-subspace C, and then a simple thresholding is used to 
test which groups f3 concentrates on. The labels that these groups corresponds to are 
predicted labels for the sample x. 

In the training stage, the label correlations and structure are naturally preserved in 
their mappings C . In the prediction stage, both discriminative and structured informa- 
tion encoded in labels are considered via group lasso. Therefore, SDGS explores label 
correlations without increasing the problem size. 



3 Training: Structured Decomposition 



In this section, we introduce the training stage of SDGS, which approximately decom- 
poses the data matrix X E R nxp into X — J2i=i L l + S. For the matrix L\ the rows 
corresponding to the samples with label i are nonzero, while the other rows are all-zero 
vectors. The nonzero rows are the components caused by label i in the samples. We use 
Qi to denote the index set of samples with label i in the matrix X and IS, and then the 
matrix composed of the nonzero rows in IS is represented by IS n .. In the decomposi- 
tion, the rank of IS n . is upper bounded, which indicates that all the components caused 
by label i nearly lies in a linear subspace. The matrix S is the residual of the samples 
that cannot be explained by the given labels. In the decomposition, the cardinality of S 
is upper bounded, which makes S sparse. 

If the label matrix of X is Y € {0, l}" xfc , the rank of L\ 2 . is bounded not more 
than r l and the cardinality of S is bounded not more than K, the decomposition can be 
written as solving the following constrained minimization problem: 



mm 

L\S 



s.t. rank {IS Q 
card (S) < K 



< r\D- =0,Vi = 1, 



(3) 



Therefore, each training sample in X is decomposed as the sum of several components, 
which respectively correspond to several labels that the sample belongs to. SDGS sep- 
arates these components from the original sample by building the mapping of Y in the 
feature space of X. For label i, we obtain its mapping in the feature subspace as the row 
space of L l Q . . 



3.1 Alternating minimization 



Although the rank constraint to L l Q . and cardinality constraint to S are not convex, the 
optimization in (3) can be solved by alternating minimization that decomposes it as the 
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following k + 1 subproblems, each of which has the global solutions: 



L hi = ar S min 

rank I J <r 

Vi = 1, . . . , k. 



X - J2 U-S-U 



(4) 



S = arg min 

card(S)<AT 



x - E L ° - s 



The solutions of i} 2 . and S in above subproblems can be obtained via hard thresh- 
olding of singular values and the entries, respectively. Note that both SVD and ma- 
trix hard thresholding have global solutions. In particular, L} 2 . is built from the first r % 

largest singular values and the corresponding singular vectors of {x — Ej=i jjti ^ ~ ^) > 
while S is built from the K entries with the largest absolute value in X — E^=i & ■> i- e ' 



Lh< = E \ g U q V q T ,i = l, 



9=1 



svd 



{ X ^=ij^ 



U - s 



) f2, 



s = v« \ x- y, L j 



and > 



X 



k 

E v 



r,s£$ 



X 



= UAV 1 
k N 

E v 

i =1 / 



(5) 



|*| < K. 



The projection 5 = V&{R) represents that the matrix S has the same entries as R on 
the index set <P, while the other entries are all zeros. 

The decomposition is then obtained by iteratively solving these k + 1 subproblems 
in (4) according to (5). In this paper, we initialize L} 2 . and S as 



U Qi := Z Qi ,i = l,...,k, 
Z = D- 1 X,D = diag(Fl) ; 
S :=0. 



(6) 



In each subproblem, only one variable is optimized with the other variables fixed. The 
convergence of this alternating minimization can be proved in Theorem 1 by demon- 
strating that the approximation error keeps monotonically decreasing throughout the 
algorithm. 

Theorem 1. The alternating minimization of subproblems (4) produces a sequence of 
\\X — Ei=i L l — S\\ 2 F that converges to a local minimum. 

Proof. Let the objective value (decomposition error) Ei=i L % — S\\% after solving 
the k + 1 subproblems in (4) to be E^, . . . , E^ 1 respectively for the t th iteration 
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round. We use subscript (t) to signify the variable that is updated in the t th iteration 



round. Then Eh-. , . 



4) 



4) 



>4T lare 



x - s (t-i) - L \t) - 

4=3 

fc 



x - s (t-i) - L \ t ) 



T 2 

) L (t-l) 



T 2 



i=3 



(7) 



(8) 



F k+1 



X 



X 



k 

£4) 



k 
i=l 



>(t-l) 



'('■) 



(9) 



(10) 



The global optimality of yields i?^ > i^ 2 t) > ■ ■ > £? f fe tV The global optimality 



ofS (t) yields E k {t) > Ef+h In addition, we have 



F k+i _ 



4+i) 



k 

■i=2 
k 

i=2 



S( t ) - 



rl 



a) 



s (t) - L \t+i) 



(ii) 



(12) 



The global optimality of yields E^t 1 > E^ t+1 y Therefore, the objective value 

(or the decomposition error) \\X — Yli=i L % — S\\ 2 F keeps decreasing throughout the 
iteration rounds of (5), i.e., 



Eh, > E^t 1 >■•■> EL > E^t 1 > 



(13) 



j m - ^(i) - - ^(t) - ^(t) 

Since the objective value of (3) is monotonically decreasing and the constraints are 
satisfied all the time, iteratively solving (4) produces a sequence of objective values 
that converge to a local minimum. This completes the proof. 

After obtaining the decomposition by solving (3), each training sample is repre- 
sented by the sum of several components in L l characterized by the labels it belongs to 
and the residual in S. Therefore, the mapping of label i in feature subspace is defined 
as the row space C l 6 M r xp of the matrix L l n ., which can be obtained via the QR 

decomposition of (L\ 2 .) T ■ 



3.2 Accelerate SDGS via bilateral random projections 

The main computation in (5) is the k times of SVD in obtaining L l n , (i = 1, . . . , k). 
SVD requires min (nm 2 ,m 2 fi) flops for an to x n matrix, and thus it is impracti- 
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cal when X is of large size. Random projection is effective in accelerating the matrix 
multiplication and decomposition [15]. In this paper, we introduce "bilateral random 
projections (BRP)", which is a direct extension of random projection, to accelerate the 
optimization of L l n . (i = 1, . . . , k). 

For clear representation, we use letters independent to the ones we use in other parts 
of this paper to illustrate BRR In particular, given r bilateral random projections (BRP) 
of an m x n dense matrix X (w.l.o.g, m > n), i.e., Y\ = XA\ and Y 2 = X T A 2 , 
wherein A\ e M" xr and A 2 E R mxr are random matrices, 



is a fast rank-r approximation of X. The computation of L includes an inverse of an 
r x r matrix and three matrix multiplications. Thus, for a dense X, 2mnr floating-point 
operations (flops) are required to obtain BRP, r 2 (2n + r) + mnr flops are required to 
obtain L. The computational cost is much less than the SVD based approximation. 

We build the random matrices A\ and A 2 in an adaptive way. Initially, both A\ 
and A 2 are set to standard Gaussian matrices whose entries are independent variables 
following standard normal distributions. We firstly compute Y\ = XA\, update A 2 := 
Yi and calculate the left random projection as Y 2 = X T A 2 by using the new A 2 , and 
then we update A\ := Y 2 and calculate the right random projection Y\ = XA\ by 
using the new A\. This adaptive updating of random matrices requires additional flops 
of mnr. 

Algorithm 1 summarizes the training stage of SDGS with BRP based acceleration. 



Algorithm 1: SDGS Training 

Input: X, Qi,r\i = l,...,k, K,e 

Output: C\i = l,...,k 

Initialize U and S according to (6), t := 0; 

while II - V* V -st >edo 

II J \\F 

t:=t + l; 

for i <— 1 to k do 




(14) 




Generate standard Gaussian matrix Ax € W 
Yi :=NA U A 2 ~Yu 
Y 2 :=Af T Fi,yi := NY 2 ; 

:= Yi (Ag Yi) -1 Y 2 T ,Lj^ := 0; 



end 



S ~T<s> (N), <P is the index set of the first K largest entries of | N \ ; 

end 

QR decomposition {L\ h ) T = Q i R i for i = 1, . . . , k, C l := {Q i ) T ; 
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4 Prediction: Group Sparsity 

In this section, we introduce the prediction stage of SDGS by estimating a group sparse 
representations of a given sample. In the training stage, we decompose the training data 
into the sum of low -rank components L\ 2 . characterized by their labels and a sparse 
residual S. The mapping of label i in the feature subspace is defined as the row space 
C l of L % n . , because the components of the training data characterized by label i lies in 
the linear subspace C . 

In the prediction stage of SDGS, we use group lasso [14] to estimate the group 
sparse representation f3 e ffi£ r * of a test sample x G W on the multi-subspace 
C = [C 1 ; . . . ; C k ], wherein the k groups are defined as index sets of the coefficients 
corresponding to C 1 , . . . , C k . Since group lasso selects nonzero coefficients group- 
wisely, nonzero coefficients in the group sparse representation will concentrate on the 
groups corresponding to the labels that the sample belongs to. 

According to above analysis, we solve the following group lasso problem in the 
prediction of SDGS: 

k 

m m-||x-/3C||^ + A]r||/fcJ 2 , (15) 

i— 1 

where the index set Gi includes all the integers between 1 + X^}=i r * an d Sj=i ^ 
(including these two numbers). 

To obtain the final prediction of label vector y G {0, l} fe for the test sample x, we 
use a simple thresholding of the magnitude sum of coefficients in each group to test 
which groups that the sparse coefficients in /3 concentrate on: 

y* = l,yv = 0,* = {i:\\p Gi \\i>6}- (16) 

Although y can also be obtained via selecting the groups with nonzero coefficients when 
A in (15) is chosen properly, we set the threshold S as a small positive value to guarantee 
the robustness to A. 

Algorithm 2 summarizes the prediction stage of SDGS. 

Algorithm 2: SDGS Prediction 

Input: x,C\i — 1, . . . , k, A, 5 
Output: y 

Solve group lasso in (15) by using group lasso; 
Predict y via thresholding in (16); 



5 Experiments 

In this section, we evaluate SDGS on several datasets of text classification, image anno- 
tation, scene classification, music categorization, genomics and web page classification. 
We compare SDGS with BR [2], ML-KNN [10] and MDDM [11] on five evaluation 
metrics for evaluating the effectiveness, as well as the CPU seconds for evaluating the 
efficiency. All the experiment are run in MatLab on a server with dual quad-core 3.33 
GHz Intel Xeon processors and 32 GB RAM. 
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5.1 Evaluation metrics 



In the experiments of multi-label prediction, five metrics, which are Hamming loss, pre- 
cision, recall, Fl score and accuracy, are used to measure the prediction performance. 

Given two label matrices Yl,Y2 e {0, l}™ xfc , wherein Yl is the real one an Y2 is 
the prediction one, the Hamming loss measures the recovery error rate: 

n k 

HamL=—J2J2 Yli i® Y2i i> (17) 



where © is the XOR operation, a.k.a. the exclusive disjunction. 

The other four metrics are precision, recall, Fl score and accuracy and are defined 

as: 



Prec - 



1 card(Yl i r\Y2 i ) 

card(y2 4 ) ' ° 



i=l 



1 ™ 

Rec = - y 



card(yi i nF2 i ) 



n ^ card(Fli) 



(19) 



pi = I V 2card ( yi » ny2 >) r20) 

n ^ card (YU) + card (Y2 4 ) ' 1 ; 
1 " cardan Y2 % ) 

n^card(n,ur20' K ' 

z=i v ' 



5.2 Datasets 



We evaluate the prediction performance and time cost of SGDS on 1 1 datasets from dif- 
ferent domains and of different scales, including Corel5k (image), Mediamill (video), 
Enron (text), Genbase (genomics), Medical (text), Emotions (music), Slashdot (text) 
and 4 sub datasets selected in Yahoo dataset (web data). These datasets were obtained 
from Mulan's website 1 and MEKA's website 2 . They were collected from different 
practical problems. Table 1 shows the number of samples n (training samples+test sam- 
ples), number of features p, number of labels k, and the average cardinality of all label 
vectors Card of different datasets. 



1 http : / / mulan .sourceforge.net/ datasets . html 
2 http://meka.sourceforge.net/ 
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Table 1. Information of datasets that are used in experiments of MS. In the table, n (training 
samples+test samples) is the number of samples, p is the number of features, k is the number of 
labels, "Card" is the average cardinality of all label vectors. 



n p k Card 



Datasets 

Corel5k 

Mediamill 

Enron 

Genbase 

Medical 

Emotions 

Slashdot 

Yahoo- Arts 

Yahoo-Education 

Yahoo-Recreation 

Yahoo-Science 



4500 + 500 
30993 + 12914 
1123 + 579 
463 + 199 
333 + 645 
391 + 202 
2338 + 1444 
2000 + 3000 
2000 + 3000 
2000 + 3000 
2000 + 3000 



499 374 3.522 
120 101 4.376 
1001 53 3.378 
1186 27 1.252 
1449 45 1.245 
72 6 1.869 
1079 22 1.181 
462 26 1.636 
550 33 1.461 
606 22 1.423 
743 40 1.451 



5.3 Performance comparison 

We show the prediction performance and time cost in CPU seconds of BR, ML-KNN, 
MDDM and SDGS in Table 3 and Table 2. In BR, we use the MatLab interface of 
LIBSVM 3.0 3 to train the classic linear SVM classifiers for each label. The parameter 
C € {l(T 3 , 1CT 2 , 0.1, 1, 10, 10 2 , 10 3 } with the best performance was used. In ML- 
KNN, the number of neighbors was 30 for all the datasets. 



Table 2. Prediction performances (%) and CPU seconds of BR [2], ML-KNN [10], MDDM [11] 
and SDGS on Yahoo. 





Methods 


Hamming loss Precision Recall Fl score Accuracy CPU seconds 




BR 


5 


76 


25 


26 


24 


46.8 


Arts 


ML-knn 


6 


62 


7 


25 


6 


77.6 


MDDM 


6 


68 


6 


21 


5 


37.4 




SDGS 


9 


35 


40 


31 


28 


11.7 




BR 


4 


69 


27 


28 


26 


50.1 


Education 


ML-knn 
MDDM 


4 
4 


58 
59 


6 
5 


31 

26 


5 
5 


99.8 
45.2 




SDGS 


4 


41 


35 


32 


29 


12.6 




BR 


5 


84 


23 


23 


22 


53.2 


Recreation 


ML-knn 
MDDM 


6 
6 


70 

66 


9 
7 


23 
18 


8 
6 


112 
41.9 




SDGS 


7 


41 


49 


36 


30 


19.1 




BR 


3 


79 


19 


19 


19 


84.9 


Science 


ML-knn 


3 


59 


4 


20 


4 


139 


MDDM 


3 


66 


4 


19 


4 


53.0 




SDGS 


5 


31 


39 


29 


26 


20.1 



3 http : / / www .csie.ntu.edu.tw/cjlin/ libsvm/ 



11 



In MDDM, the regularization parameter for uncorrelated sub space dimensionality 
reduction was selected as 0.12 and the dimension of the subspace was set as 20% of 
the dimension of the original data. In SDGS, we selected r l as an integer in [1,6], 
K e [10- 6 ,10- 3 ], A e [0.2,0.45] and 5 e [10~ 4 ,10- 2 ]. We roughly selected 4 
groups of parameters in the ranges for each dataset and chose the one with the best 
performance on the training data. Group lasso in SDGS can be solved by many convex 
optimization methods, e.g., submodular optimization [16] and SLEP [17]. We use SLEP 
in our experiments. 

The experimental results show that SDGS is competitive on both prediction perfor- 
mance and speed, because it explores label correlations and structure without increas- 
ing the problem size. In addition, the bilateral random projections further accelerate the 
computation. SDGS has smaller gaps between precision and recall on different tasks 
than other methods, and this implies it is robust to the imbalance between positive and 
negative samples. 

6 Conclusion 

In this paper, we propose a novel multi-label learning method "Structured Decomposi- 
tion + Group Sparsity (SDGS)". Its training stage decomposes the training data as the 
sum of several low-rank components L\ 2 . corresponding to their labels and a sparse 
residual S that cannot be explained by the given labels. This structured decomposition 
is accomplished by a bilateral random projections based alternating minimization, and 
it converges to a local minimum. The row space C % of L} 2 . is the mapping of label i in 
the feature subspace. The prediction stage estimates the group sparse representation of 
a new sample on the multi-subspace C % via group lasso. SDGS predicts the labels by 
selecting the groups that the nonzero representation coefficients concentrate on. 

SDGS finds the mapping of labels in the feature space, where the label correlations 
are naturally preserved in the corresponding mappings. Thus it explores the label struc- 
ture without increasing the problem size. SDGS is robust to the imbalance between 
positive and negative samples, because it uses group sparsity in the multi-subspace to 
select the labels, which considers both the discriminative and relative information be- 
tween the mappings of labels in feature subspace. 
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Table 3. Prediction performances (%) and CPU seconds of BR [2], ML-KNN [10], MDDM [11] 
and SDGS on 7 datasets. 
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